-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gbn: Sender side only packeting resend bug fix #87
gbn: Sender side only packeting resend bug fix #87
Conversation
57e30f5
to
764fe2f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated with docs, and made some small changes to the PR! This is now ready for review 🔥.
I'll address CI failures tomorrow, but the PR should be reviewable even before that happens.
I've confirmed locally through manual testing, that this PR causes no issues if the two parties have different resend timeouts. But, something I'd like to test is the interoperability if the client has updated to the new LNC version, but the server hasn't. Would be good to ensure that that doesn't cause more resends by the server than prior to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice investigation!!! 🎉
Left a bunch of comments - basically, I think we can simplify and modularise the logic quite a bit by letting the queue
handle everything.
8f8ecc9
to
09b83a0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the review @ellemouton!!
I've addressed your feedback, and I totally agree that it makes much more sense to let the queue struct handle the resend catch up logic instead, so I've moved it there!
Please also check my response here, and let me know what you'd like me to do with the catchUpID
: #87 (comment)
09b83a0
to
dd6046e
Compare
The itest CI that failed this round of the CI workflow is pre-existing since before this PR. It works sometimes, and fails sometimes. |
You are able to trigger a manual re-run right? Also - can you make an issue for this flake so that some investigation can be done? |
I've learned how to trigger a re-run of the workflow now :)! Also opened a new issue for the error. Thanks for the nudge :)! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check out this patch for how we can do a bit of cleanup: gbn_cleanup.patch
Left a few comments - Im mainly concerned with adding complexity to the system to solve this issue. I think we should maybe first focus on reducing the complexity and then tackle the issue at hand.
So im thinking we should really try to divide up what exactly is being handled by GBN and what is being handled by queue
. I think we can make use of a state-machine model too. Currently it seems like the complexity is coming from the fact that both the queue and the GBN can tricker timeouts & resends. I think this should only be handled by one entity. So for example, if we are in the "resend&sync" state, then nothing can trigger a new resent&sync state since we are already in that state.
Let me know what you think.
gbn/gbn_conn.go
Outdated
@@ -578,7 +591,10 @@ func (g *GoBackNConn) receivePacketsForever() error { // nolint:gocyclo | |||
} | |||
|
|||
case *PacketACK: | |||
gotValidACK := g.sendQueue.processACK(m.Seq) | |||
gotValidACK := g.sendQueue.processACK( | |||
m.Seq, g.resendTimeout, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we never change the value of g.resentTimeout
, then why do we need to pass it to sendQueue
each time we call processACK
? It can instead be passed to it on instantiation and be a member of queue
right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the problem with just setting the resendTimeout
through the constructor, is that it the queue's resendTimeout
then currently won't get set to the value set by the WithTimeout
option (Though that's easily fixed of course).
Though I've fixed that issue in #88 now, by the addition of the TimeoutManager
. If you agree, we'll let this PR still pass the timeout in the different function calls to the queue, and then remove that with #88 instead?
f449c3a
to
a351b99
Compare
// As we can't be sure that waitForSync cancel listener was | ||
// triggered before this one, we send over the cancel channel | ||
// again, to make sure that both listeners are triggered. | ||
c.reset() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could potentially cover this instead by closing the cancel
channel when executing reset
function, and creating a new channel on every sync init (and also use a sync.Once
to ensure we only cancel the channel once per sync). But I figured this version is probably a bit more clean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sync.Once to ensure we only cancel the channel once per sync
But then you would need to re-init the sync.Once each time too which kinda defeats the purpose right?
I think the state machine approach is a bit nicer than closing & recreating channels.
If you want to make sure that the reset
only does something if we actually need it to,
then, in the reset
method, you could check that we are in the "Resending" state, if we are not, it becomes a no-op. That way, the state machine becomes a bit more safe. we could add the same type of check anywhere else where we change the state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then you would need to re-init the sync.Once each time too which kinda defeats the purpose right?
Yes exactly, which is why I opted for this solution :)!
If you want to make sure that the reset only does something if we actually need it to,
then, in the reset method, you could check that we are in the "Resending" state, if we are not, it becomes a no-op. That way, the state machine becomes a bit more safe. we could add the same type of check anywhere else where we change the state
Hmmm I'm not sure how this would address that we wouldn't need to call the c.reset()
both here and at the end of the waitForSync
function as we can't be sure which of the select cases here + or in the waitForSync
will catch the <-c.cancel
case first? We need to ensure that both trigger when c.cancel
channel get's sent over. Is this a case where you'd use a channel with a size of 2, or is there a better way to solve this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds to me like this could be instead handled by a sync.Cond
. But IMO that is harder to understand (and also not too easy to use in the first place). So I leave it to you if you want to refactor or leave it the way it is now.
Awesome!
Yeah - ideal PR structure would be: 1) Add new syncer + test, then 2) inject it & use it in queue . Otherwise we sorta have to review the logic twice. (dont worry about keeping me as commit author for the syncer refactor - it was just putting your code in another file anyways 😊 ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving some comments :)
I think your idea of first adding the syncer & then injecting the usage is a good one and will make reviewing easier since we then only need to review the logic once vs 2 versions of similar logic.
One thing that concerns me is that it seems (from looking at the results of some of the latest GH action runs) that with this PR, the speed of the reconnection test has strictly been increased and so is resulting in more timeouts.
before PR
--- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:false,grpcClientConn:false) (9.82s)
--- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:false,grpcClientConn:true) (14.81s)
--- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:true,grpcClientConn:false) (12.78s)
--- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:true,grpcClientConn:true) (16.00s)
--- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:false,grpcClientConn:false) (11.80s)
--- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:false,grpcClientConn:true) (19.80s)
--- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:true,grpcClientConn:false) (21.29s)
--- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:true,grpcClientConn:true) (19.86s)
--- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:false,grpcClientConn:false) (9.82s)
--- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:false,grpcClientConn:true) (13.81s)
--- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:true,grpcClientConn:false) (10.95s)
--- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:true,grpcClientConn:true) (14.66s)
--- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:false,grpcClientConn:false) (15.82s)
--- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:false,grpcClientConn:true) (19.80s)
--- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:true,grpcClientConn:false) (22.18s)
--- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:true,grpcClientConn:true) (18.52s)
--- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:false,grpcClientConn:false) (9.82s)
--- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:false,grpcClientConn:true) (13.84s)
--- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:true,grpcClientConn:false) (12.71s)
--- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:true,grpcClientConn:true) (15.70s)
--- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:false,grpcClientConn:false) (21.82s)
--- FAIL: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:false,grpcClientConn:true) (33.75s)
with PR
--- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:false,grpcClientConn:false) (9.82s)
--- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:false,grpcClientConn:true) (28.67s)
--- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:true,grpcClientConn:false) (24.85s)
--- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:true,grpcClientConn:true) (16.08s)
--- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:false,grpcClientConn:false) (27.85s)
--- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:false,grpcClientConn:true) (41.81s)
--- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:true,grpcClientConn:false) (23.67s)
--- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:true,grpcClientConn:true) (55.31s)
--- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:false,grpcClientConn:false) (29.89s)
--- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:false,grpcClientConn:true) (29.85s)
--- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:true,grpcClientConn:false) (12.09s)
--- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:true,grpcClientConn:true) (29.47s)
--- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:false,grpcClientConn:false) (38.83s)
--- FAIL: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:false,grpcClientConn:true) (35.50s)
--- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:false,grpcClientConn:false) (14.83s)
--- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:false,grpcClientConn:true) (33.87s)
--- FAIL: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:true,grpcClientConn:false) (33.06s)
--- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:false,grpcClientConn:false) (31.83s)
--- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:false,grpcClientConn:true) (39.80s)
--- FAIL: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:true,grpcClientConn:false) (32.93s)
Is this perhaps a necessary/expected increase due to the fact that we now back-off a bit?
// As we can't be sure that waitForSync cancel listener was | ||
// triggered before this one, we send over the cancel channel | ||
// again, to make sure that both listeners are triggered. | ||
c.reset() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sync.Once to ensure we only cancel the channel once per sync
But then you would need to re-init the sync.Once each time too which kinda defeats the purpose right?
I think the state machine approach is a bit nicer than closing & recreating channels.
If you want to make sure that the reset
only does something if we actually need it to,
then, in the reset
method, you could check that we are in the "Resending" state, if we are not, it becomes a no-op. That way, the state machine becomes a bit more safe. we could add the same type of check anywhere else where we change the state
gbn/queue.go
Outdated
@@ -101,8 +118,6 @@ func (q *queue) resend(cb func(packet *PacketData) error) error { | |||
return nil | |||
} | |||
|
|||
q.lastResend = time.Now() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where are we setting it now thought?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good catch, my bad!
gbn/queue.go
Outdated
@@ -112,36 +127,48 @@ func (q *queue) resend(cb func(packet *PacketData) error) error { | |||
q.topMtx.RUnlock() | |||
|
|||
if base == top { | |||
q.syncer.reset() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need a reset
here?
Initially in my diff I had it here cause there was an init
state above this. but if that is removed then this can be removed right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw - the reason i had the other "init" state here was to remove the need for locking the awaitMu mutex before doing the seqBase and seqTop locks.
if we still want there to be no other "resend" call during the time this is going, then we should add that back.
but i think this is synchronous for now right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need a reset here
Ah my bad, missed removing that. Thanks!
btw - the reason i had the other "init" state here was to remove the need for locking the awaitMu mutex before doing the seqBase and seqTop locks.
Yeah, it's synchronous currently, so we will never call the queue.resend
function twice at the same time while it's currently resending the function.
Since we now process the awaited ACK+NACK while we are also resending packets, we now need to ensure that we set the syncStateResending
while under the same lock that we also set the awaitedACK
+ awaitedNACK
which is why I changed this. But like you say, if we add an extra init state to the syncer we could achieve that we wouldn't allow two concurrent calls of the queue.resend
function. Opted to not add it now though to not cause extra confusion with unused functionality, butcan change it if you'd prefer :).
gbn/queue.go
Outdated
if seq == q.sequenceTop { | ||
q.sequenceBase = q.sequenceTop | ||
} | ||
|
||
// If we receive the awaited NACK, we shouldn't trigger a new | ||
// resend, as we can now proceed to send new packets. | ||
return false, false | ||
} | ||
|
||
// If the NACK is the same as sequenceTop, and we weren't awaiting this | ||
// NACK as part of the resend catch up, it probably means that queue | ||
// was sent successfully, but we just missed the necessary ACKs. So we | ||
// can empty the queue here by bumping the base and we dont need to | ||
// can empty the queue here by bumping the base and we don't need to | ||
// trigger a resend. | ||
if seq == q.sequenceTop { | ||
q.sequenceBase = q.sequenceTop | ||
return true, false | ||
|
||
return false, false | ||
} | ||
|
||
// Is the NACKed sequence even in our queue? | ||
if !containsSequence(q.sequenceBase, q.sequenceTop, seq) { | ||
q.cfg.log.Tracef("NACK seq %d is not in the queue. Ignoring.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems the "bumped" logic is changed here only to be changed again in a later commit...
If there is a bug in the original logic - perhaps better to fix that in it's own commit before any additions. Cause these 2 diffs are a bit confusing to follow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good idea, cleaned this up with the latest push :)!
a1ee556
to
b101a16
Compare
Thanks a lot for the review @ellemouton! I've updated the commit history quite extensively with the latest push, and addressed your feedback, so I hope this version looks better :)! 🚀
Yeah this was part of why I added the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good to me, great work!
But it seems like the server or client disconnect integration tests fail pretty consistently. Do you think that's something that needs to be addressed with different timeouts (e.g. #88) or should they pass reliably for this PR too?
I'm happy to merge this as-is if one of the PRs down the road show them passing reliably.
// As we can't be sure that waitForSync cancel listener was | ||
// triggered before this one, we send over the cancel channel | ||
// again, to make sure that both listeners are triggered. | ||
c.reset() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds to me like this could be instead handled by a sync.Cond
. But IMO that is harder to understand (and also not too easy to use in the first place). So I leave it to you if you want to refactor or leave it the way it is now.
b101a16
to
82af2e2
Compare
@ellemouton: review reminder |
I think the tests are timing out since we now don’t just keep resending right? the test that times out is the reconnection test - meaning that packets are getting dropped. So before, we would resend more frequently & now we back-off quite a bit. So I think if we are confident that this PR is an improvement on reliability then we should just increase the itest timeout to whatever the average timeout required by that test is. Looks like 1 min should be fine right? In order to ensure reliability, I suggest running a simulation overnight that keeps sending packets back and forth and occasionally does a restart of the aperture server - the test is successful if by the end of the test (after 12 hours or w/e) the client and server are still in sync & no resend loops are occurring |
82af2e2
to
c95b563
Compare
Ok so I've finally figured out what's going on in the itests! There were 2 core problems, 1 which was caused by a bug in this PR, and 1 which is pre-existing: Issue 1:The bug which was caused by this PR mainly effected the I've addressed this with the latest push! Issue 2:The problem is pre-existing, and is not made worse by this PR. This is effecting the So the core issue is that when we stop the server here, the server sends a FIN packet, and then closes the connection:
We then verify that client's connection status has change to But as there's a sliiiight delay on between when the FIN packet is sent, and when the connection is closed, 25-50% of the time when running the test, if the client receives the FIN packet from the server and calls When client then processes the FIN packet, the client will then close down and never change the the connection status. Therefore the predicate in the For the cases when the test is run and instead actually succeeds, the connection is already closed when the connection lightning-node-connect/mailbox/client_conn.go Line 310 in fa60171
I'm not a 100% sure how we should handle this correctly, so would appreciate any feedback on how you think this should be handled best! One potential solution that I experimented with would be to trigger the mailbox Finally, it's expected that the itests should take a bit longer for the reconnection tests as we will now not resend as frequently and back-off quite a bit as @ellemouton points out above. However I actually don't think we need to increase any timeouts, and it's mainly the 2 issues I've detailed above that's been causing issues for the itests 🚀!
This is a great idea, I'll implement that simulation asap and let it run for a while as suggested 🔥! |
Really nice investigation @ViktorTigerstrom 🔥 🔥 regarding how to solve for issue 2, could an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the patch @ellemouton! That worked really well 🔥🚀!
I've added it on top of this PR, so itests are now finally consistently passing 🎉!
gbn.WithKeepalivePing( | ||
gbnClientPingTimeout, gbnPongTimeout, | ||
), | ||
gbn.WithOnFIN(func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could potentially add this to the server_conn.go
version as well, and force the server to set the connection status to status ServerStatusIdle
as well. But I haven't really seen any case where this has caused any issues on the server side yet, so opted to not add it. But Will add it to the server side as well if you think I should!.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome investigation on the itest flake hunt! Very nice to have things pass so consistently now.
Great work, LGTM 🎉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tACK, LGTM! Great investigations & great work!! 💥 🚀
The majority of my comments are spelling/grammar nits
4998324
to
1e9d938
Compare
Thanks a lot for the review @ellemouton! |
1e9d938
to
f41a037
Compare
The processNACK function returns two booleans as its signature. The first boolean used to only indicate if the NACK was part of the queue, but is with this commit expanded to indicate if the queue needs to be resent or not as a result of the NACK. The second boolean indicates if the queue sequence base was bumped or not. Previously, the bumped value for processNACK didn't accurately reflect if the queue sequence base was actually bumped or not for the cases when we didn't trigger a resend. This commit addresses that issue. Updating the bumped value to be correct has no real impact on the current implementation of the gbn conn, but it might start having an effect if the implementation is modified in the future.
The syncer struct can be used to ensure that both the sender and the receiver are in sync before the waitForSync function is completed. This ensures that we don't end up in a loop where one party keeps resending packets over and over again. This is done by awaiting that we receive either the an expected ACK or an expected NACK after resending the queue. The expected ACK is the ACK for the last packet that has been resent, and the expected NACK is the NACK sequence following the last packet that has been resent. To understand why we need to await the awaited ACK/NACK after resending the queue, consider the following scenario: 1. Alice sends packets 1, 2, 3 & 4 to Bob. 2. Bob receives packets 1, 2, 3 & 4, and sends back the respective ACKs. 3. Alice receives ACKs for packets 1 & 2, but due to latency the ACKs for packets 3 & 4 are delayed and aren't received until Alice resend timeout has passed, which leads to Alice resending packets 3 & 4. Alice will after that receive the delayed ACKs for packets 3 & 4, but will consider that as the ACKs for the resent packets, and not the original packets which they were actually sent for. If we didn't wait after resending the queue, Alice would then proceed to send more packets (5 & 6). 4. When Bob receives the resent packets 3 & 4, Bob will respond with NACK 5. Due to latency, the packets 5 & 6 that Alice sent in step (3) above will then be received by Bob, and be processed as the correct response to the NACK 5. Bob will after that await packet 7. 5. Alice will receive the NACK 5, and now resend packets 5 & 6. But as Bob is now awaiting packet 7, this send will lead to a NACK 7. But due to latency, if Alice doesn't wait resending the queue, Alice will proceed to send new packet(s) before receiving the NACK 7. 6. This resend loop would continue indefinitely, so we need to ensure that Alice waits after she has resent the queue, to ensure that she doesn't proceed to send new packets before she is sure that both parties are in sync. To ensure that we are in sync, after we have resent the queue, we will await that we either: 1. Receive a NACK for the sequence number succeeding the last packet in the resent queue i.e. in step (3) above, that would be NACK 5. OR 2. Receive an ACK for the last packet in the resent queue i.e. in step (3) above, that would be ACK 4. After we receive the expected ACK, we will then wait for the duration of the resend timeout before continuing. The reason why we wait for the resend timeout before continuing, is that the ACKs we are getting after a resend, could be delayed ACKs for the original packets we sent, and not ACKs for the resent packets. In step (3) above, the ACKs for packets 3 & 4 that Alice received were delayed ACKs for the original packets. If Alice would have immediately continued to send new packets (5 & 6) after receiving the ACK 4, she would have then received the NACK 5 from Bob which was the actual response to the resent queue. But as Alice had already continued to send packets 5 & 6 when receiving the NACK 5, the resend queue response to that NACK would cause the resend loop to continue indefinitely. When either of the 2 conditions above are met, we will consider both parties to be in sync, and we can proceed to send new packets.
When the client receives+processes a FIN packet, we force the connection to status ClientStatusSessionNotFound, as in rare occasions the corresponding server may have time to close the connection before we've already processed the sent FIN packet by the server. In that case, if we didn't force a new status, the client would never mark the connection as status ClientStatusSessionNotFound. This solves a bug for the testServerReconnect itest, as that test would sometimes fail due to the FIN packet being processed but without changing the connection status, leading to a test predicate never being satisfied.
f41a037
to
5e16121
Compare
Fixes #89
This PR addresses the packeting resend bug through a sender only side fix.
This PR ensures that after we have resent the queue, we will wait until we know that both parties are in sync before we continue to send new packets. This ensures that we don't end up in an indefinitely resend loop due to latency and delayed NACKs by the other party, which could happen prior to this commit.
To understand why we need to await the awaited ACK/NACK after resending the queue, consider the following scenario:
To ensure that we are in sync, after we have resent the queue, we will await that we EITHER:
original packets we sent, and not ACKs for the resent packets. In step (3) above, the ACKs for packets 3 & 4 that Alice received were delayed ACKs for the original packets. If Alice would have immediately continued to send new packets (5 & 6) after receiving the ACK 4, she would have then received the NACK 5 from Bob which was the actual response to the resent queue. But as Alice had already continued to send packets 5 & 6 when receiving the NACK 5, the resend queue response to that NACK would cause the resend loop to continue indefinitely.
When either of the 2 conditions above are met, we will consider both parties to be in sync, and we can proceed to send new packets.
TODO: